Data Science without the Data

Rhian Davies | @statsRhian

About Me 👋

  • Data Scientist at Jumping Rivers
  • RSS Statistical Ambassador
  • Bad at French (Je suis désolé 😳)

Cartoon of a woman holding out a book

About Jumping Rivers

  • Data science & machine learning
  • Training courses
  • Dashboard development and deployment
  • Infrastructure
  • Managed Posit services

Cartoon of three people working at computers

I’m going to tell you a story

The Client

  • Database of patients with a rare disease
  • Consulted us to perform the data analysis for a study
    • 200 statistical results (count, %, mean, sd, median, IQR)
    • Interrupted Time Series Analysis

A cartoon robot holding a testtube and wearing a lab coat

Stratifications

  • Country
  • Subtypes of the disease
  • Mobility
  • Drug
  • Year
  • Age of patient

For example

  • What is the average age of patients when they are diagnosed (by country and subtype)?

  • What percentage of patients are taking Drug A (by country, subtype and year)?

Simple, yes?

data |>
  group_by(country, subtype) |>
  summarise(mean = age_at_diagnosis)

The challenge 🙈

  • Write a detailed Statistical Analysis Plan without seeing any data
  • Start development with a small subset of the data
  • We can’t see the data for Germany ever

Time to chat 💬

  • Have you experienced scenarios have led you to having no data?

  • What problems did you encounter?

Our plan

The power of statistical summaries

  • For each dataset, calculate all the summaries we might need at a granular level
  • We can combine these summaries as we like
  • Calculate stratified results from the summaries
    • Mean: \(\frac{1}{N} \sum_{i=i}^{N} x_{i}\)
    • Standard deviation: \(\frac{1}{N - 1} \sqrt{\sum_{i=i}^{N} x^2_{i} - (\sum_{i=i}^{N} x_{i} )^2}\)

A small cartoon robot stood next to a huge pile of data

Develop an R package

  • Run it on the data we can see
  • Send it to Marcus
  • He sends us an .RDS
  • We can aggregate and plot as needed
devtools::install_local("describeDisease.tar.gz")
library("describeDisease")
run_analysis("path/to/german.xlsx")

Cartoon people holding wraped presents

Where to develop?

  • Data security is important
  • Client wanted controlled access and logs
  • Shared projects
  • Multiple sessions

The posit workbench logo

Data exploration

  • What values are unique per patient?

  • Which stratifications are viable?

  • Quarto document for data exploration and validation

The posit workbench logo

Data validation packages 📦

What happened?

Sure, we’ll send you dummy data

Oh no

It was an XLSX worksheet

Sure, we’ll send you the schema

Database schema for a single indicator listing allowed entries

Oh no

Data structure Example data collection (forms) Formal database specification Dummy data Might not be available Can’t rely on these

  • Missing data types
  • Data types not defined
    • Dates were yyyymm character

Cartoon figure saying 'Oh no'

Sure we’ll send you validated data

Oh no

  • It wasn’t validated.
  • Patients with stop dates but no start dates
  • Patients with start & stop dates but with the drug name missing

Whose responsibility is it?

Cartoon figure saying 'Oh no'

Okay let’s run the analysis

Oh no

Hi Rhian, I have run the code, unfortunately I get the error you can see below. Please let me know how to proceed. Thanks, Marcus

Error in `purrr::map()`:■■■■■■■■■■■■■■■■■               53% | ETA: 11s
In index: 18.
Caused by error in `dplyr::group_by()`:
! Must group by variables found in `.data`.
Column `time_axis` is not found.

Cartoon figure saying 'Oh no'

Generating results…

Oh no

#' eval: TRUE
wb = openxlsx::createWorkbook("Results")
openxlsx::addWorksheet(wb, "Analysis by country, subtype and drug")
Cartoon figure saying 'Oh no'

Final run

Sure, I’ll run it right away and let you know!

Oh no

Unfortunately, I get the error below. The same error also appears when I only use the data that you already have, which is strange because I suppose that you have already tested this script on that data. I’ll try fiddling around a bit to make sure it’s not something on my side, but we can also have a chat and screenshare if that could help!

    Error in `dplyr::left_join()`:
    ! `...` must be empty.
    ✖ Problematic argument:
    • relationship = "many-to-many"

Cartoon figure saying 'Oh no'

{dplyr} version

  • We specified {dplyr} v1.1.0
  • We needed to specify {dplyr} v1.1.1
  • {renv} or Docker would have avoided this

Diffify hex sticker a red package symbol next to a green package symbol

Tada 🎉

Table of results, mostly blank (-,-)

Facetted ggplot graph showing points and standard deviation

Client happy

  • Understood the data
  • Sees the potential
  • Better informed for the next round of studies

Useful R bits

  • {cli} make sharing code output from collaborators easier
  • purrr::map(.progress = "Running analysis")
  • any_of() was helpful for missing columns
  • Quarto handbook for tracking assumptions and time spend
  • Leaned heavily on purrr::map2() with tidyr::nest()

In hindsight

  • Push back earlier to evidence to data challenges
  • Set realistic expectations
  • Use a proper database
  • Design it to let Marcus run the entire pipeline
  • Different git workflow
  • Use {renv} from the start

Questions?

TO DO

  • Colour of quote font in SCSS
  • Check the {dplyr} bug
  • Upload slides to GitHub
  • Create QR
  • Check images and links
  • Mini example on purrr list col
  • Comparison of validation packages and examples
  • Spell check